In this prelab, you will experiment with Principal Components to display the result of clustering on the mtcars dataset. We will do the following standard analyses using PCA with clustering.
prcomp command and return the results into an object called my.pca.my.pca using ‘screeplot’summary(my.pca)Since the data in the prelab is 3 dimensional, we will also examine the principal component vectors to further understand the plots.
Some hints:
prcomp to do the analysis with options
retrx = TRUE so prcomp returns the vectors and scalar projectionscenter=FALSE, scale=FALSE because we will scale the data ourselves.geom_hline and geom_vline to add the horizontal and vertical lines, respectively, to the ggplot.Copy prelab4.Rmd to your working directory IDM_work. Use this for your assignment. Do a practice “knit” (to pdf or html) before you begin.
In this exercise, we will work with mtcars using only “mpg”, “cyl”,“hp” which are the following measurement of cars: miles per gallon, the number of cylinders, and the horse power of the car.
# Create a scaled version of the subset of mtcars data
mycars.matrix<-scale(as.matrix(mtcars[,c("mpg","cyl","hp")]))
mycars.df<-as.data.frame((mycars.matrix))
Now we do the PCA analysis, check out the proportion of explained variance using summary command, and then plot the scree plot.
# Run PCA using prcomp
my.pca <- prcomp(mycars.df)
# See proportion of variance explained
summary(my.pca)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 1.6251 0.47457 0.36585
## Proportion of Variance 0.8803 0.07507 0.04462
## Cumulative Proportion 0.8803 0.95538 1.00000
# scree plot
plot(my.pca, type = "l")
A ‘scree plot’ is a simple line segment plot that shows the variance in the data as explained or represented by each PC. The PCs are ordered in decreasing order of variance. In this plot, we see that the first two principal components explain most of the variance. From the summary we see that these two components together explain 95.54% of the variance. The proportion of the variance explained by the first component 0.8803. The curve is essentially flat after two components, so the third component is not valuable for interpretation.
The prcomp command returns all the parts needed to use PCA.
Try it
my.pca$rotation contains a matrix of principal component vectorsmy.pca$x contains the scalar projections of the data# Fool around here to understand the parts.
str(my.pca)
## List of 5
## $ sdev : num [1:3] 1.625 0.475 0.366
## $ rotation: num [1:3, 1:3] -0.5746 0.5875 0.5698 0.648 -0.0987 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:3] "mpg" "cyl" "hp"
## .. ..$ : chr [1:3] "PC1" "PC2" "PC3"
## $ center : Named num [1:3] 7.11e-17 -1.47e-17 1.04e-17
## ..- attr(*, "names")= chr [1:3] "mpg" "cyl" "hp"
## $ scale : logi FALSE
## $ x : num [1:32, 1:3] -0.453 -0.453 -1.424 -0.491 0.964 ...
## ..- attr(*, "dimnames")=List of 2
## .. ..$ : chr [1:32] "Mazda RX4" "Mazda RX4 Wag" "Datsun 710" "Hornet 4 Drive" ...
## .. ..$ : chr [1:3] "PC1" "PC2" "PC3"
## - attr(*, "class")= chr "prcomp"
# Print the principal components found by PCA.
my.pca$rotation
## PC1 PC2 PC3
## mpg -0.5746116 0.64799373 0.4999256
## cyl 0.5874747 -0.09871957 0.8031985
## hp 0.5698200 0.75522088 -0.3239545
#Print the scalar projects of the data on the principal components.
my.pca$x
## PC1 PC2 PC3
## Mazda RX4 -0.45328449 -0.295976512 0.16445088
## Mazda RX4 Wag -0.45328449 -0.295976512 0.16445088
## Datsun 710 -1.42407805 -0.179149731 -0.50539618
## Hornet 4 Drive -0.49142065 -0.252970087 0.19763023
## Hornet Sportabout 0.96410310 0.062159292 0.56602726
## Valiant -0.21835189 -0.662848160 -0.05247479
## Duster 360 1.96536479 0.360139513 -0.12969049
## Merc 240D -1.83426100 -0.348589429 -0.22620604
## Merc 230 -1.40745623 -0.157119706 -0.51484603
## Merc 280 -0.17362990 -0.346310259 -0.04628025
## Merc 280C -0.04015334 -0.496832748 -0.16240799
## Merc 450SE 1.22494059 -0.130052589 0.35162135
## Merc 450SL 1.13913423 -0.033288132 0.42627490
## Merc 450SLC 1.33934907 -0.259071865 0.25208330
## Cadillac Fleetwood 2.00475583 -0.499773647 -0.26419210
## Lincoln Continental 2.08786496 -0.389623519 -0.31144137
## Chrysler Imperial 1.80256493 0.237920746 -0.02563723
## Fiat 128 -2.56374055 0.555599129 0.41848130
## Honda Civic -2.48941254 0.186356823 0.31873352
## Toyota Corolla -2.71506207 0.705858211 0.54762880
## Toyota Corona -1.26689188 -0.274860562 -0.63212878
## Dodge Challenger 1.06141956 -0.557267431 0.41871561
## AMC Javelin 1.09002168 -0.589522250 0.39383110
## Camaro Z28 2.06070519 0.252623450 -0.21263887
## Pontiac Firebird 0.91643290 0.115917324 0.60750146
## Fiat X1-9 -2.07750451 0.007267206 -0.00455544
## Porsche 914-2 -1.74578916 0.142871645 -0.23051150
## Lotus Europa -1.98244683 0.858272607 0.03051298
## Ford Pantera L 1.98026154 0.730698853 -0.09504152
## Ferrari Dino 0.21086739 0.280228441 -0.25050225
## Maserati Bora 2.64660870 1.426751914 -0.49687003
## Volvo 142E -1.15762688 -0.153432014 -0.69712274
Now we cluster the data using `hclust’ to do hierarchical clustering. We plot the dendrogram and decide on 5 clusters.
set.seed(30)
# cluster the car models using hierarchical clustering
my.clust <- hclust(dist(mycars.df), method = "complete")
# show the dendrogram
plot(my.clust)
# cut the dendrogram into 5 clusters and save the cluster vector
my.clustname <- as.factor(cutree(my.clust, k = 5))
Now we draw a plot of the scalar projections of the first two PCA components. This is called a scoreplot. Each point is colored by its cluster and labeled with its row name. We use the biplot command to draw a scoreplot by setting ‘var.axes = FALSE’. If you drop that part of the command, you get a biplot which will be covered in the next lab.
# Draw score plot using regular method
p <- ggbiplot(my.pca,
choices=c(1,2), # this indicates we want pc1 and pc2
alpha=.4, # this makes the points transparent
var.axes = FALSE, # this suppresses axes that appear in a biplot
labels=rownames(mycars.df),
groups=as.factor(my.clustname))
p <- p + ggtitle('Mycars Scoreplot of PC1 and PC2') +
coord_cartesian(xlim=c(-3,3), ylim=c(-3,3)) # picking the x and y axes to put 0 in the middle .
## Coordinate system already present. Adding new coordinate system, which will replace the existing one.
#If not a pdf then make an interactive plot using ggplotly otherwise do a normal static plot.
if(out_type!="latex") {ggplotly(p)} else {p}
When interpreting a pca result it can be helpful to examine the principal component vectors. Here is a heatmap with no scaling and reordering for the first two principal components.
V <- t(my.pca$rotation) # We transpose to make the principal components be rows
heatmap.2(V, main='Principal Components', cexRow=0.75, cexCol=0.75, scale="none", dendrogram="none",
Colv= FALSE, Rowv=FALSE, tracecol=NA,density.info='none')
If PC1 was a car, how would you describe its mpg (miles per gallon), cyl (cylinders), and hp (horse power)? It has more cylinders, and as a result more horsepower, however it is worse when it comes to fuel efficiency (mpg)
The Lincoln Continental is a large powerful luxury car (i.e. hp is high). The Honda Civic is an economy car that doesn’t go so fast (i.e. hp is low).
Lincoln ContinentalLincoln Continental on vector PC1 my.pca$rotation[,1]. Verify that the same answer is in the loading matrix my.pca$x["Lincoln Continental",1]Honda Civic# Lincoln Continental
#Data Point:
Lincoln_data <- as.numeric(mycars.matrix["Lincoln Continental",])
#Scalar projection:
PC1_vec <- my.pca$rotation[,1]
Lincoln_proj <- (t(Lincoln_data) %*% PC1_vec)[1,1]
Lincoln_proj
## [1] 2.087865
my.pca$x["Lincoln Continental",1]
## [1] 2.087865
# Honda Civic
#Data Point:
Civic_data <- as.numeric(mycars.matrix["Honda Civic",])
#Scalar projection:
PC1_vec <- my.pca$rotation[,1]
Civic_proj <- (t(Civic_data) %*% PC1_vec)[1,1]
Civic_proj
## [1] -2.489413
my.pca$x["Honda Civic",1]
## [1] -2.489413
You’ve now completed Prelab4! Go to LMS and complete the online quiz.
#Quiz on LMS
q3 = c(1,0,0)
q4 = c(0,0,1)
PC1_vec <- my.pca$rotation[,1]
PC2_vec <- my.pca$rotation[,2]
(t(q3) %*% PC1_vec)[1,1]
## [1] -0.5746116
(t(q4) %*% PC2_vec)[1,1]
## [1] 0.7552209
The Appendix lets you dig deeper into R to learn more.